Class Imbalance

The harms of class imbalance corrections for machine learning based prediction models: a simulation study

Thomas Reinke

Baylor University

Theophilus A. Bediako

Baylor University

August 10, 2025

Contents

  1. Introduction
  2. Methods
  3. Results
  4. Case Study
  5. Discussion
  6. Conclusion
  7. References

Original Paper

The harms of class imbalance corrections for machine learning based prediction models: a simulation study (Carriero et al. 2024)

Introduction

Introduction

  • Risk prediction models are increasingly vital in healthcare
    • Can help determine an individual’s risk of disease
  • Data used to train these models often suffer from class imbalance, where one class (e.g., patients with a rare disease) is much smaller than the other.
  • apply imbalance corrections (e.g., over- or under-sampling) to artificially balance the dataset.
  • However, the effect of these corrections on the calibration of modern machine learning models is not always clear
    • Model calibration captures agreement between the estimated (predicted) and observed number of events
    • Poorly calibrated model over-estimates or underestimate true risks
    • Leading to poor treatment decisions
  • This study examines the impact of imbalance corrections on the performance—especially calibration—of several machine learning algorithms.

Methods

Methods

  • Implemented a simulation study to investigate the effects of imbalance corrections methods across 18 unique data-generating scenarios
  • Focused on prediction models for dichotomous risk prediction
  • Compared prediction performance of models developed with imbalanced corrected data to those without correction

Data Gathering Scenarios

  • 2000 datasets per scenario
  • All scenarios were designed to produce data with an expected concordance statistic (C-statistic) of 0.85

Data Generating Mechanism

  • Data for the two classes (events and non-events) were generated from distinct multivariate normal distributions.

\[\text{Class 0:} \; \mathbf{X} \sim MVN(\mathbb{\mu_{0}, \mathbb{\Sigma_{0}}}) = MVN(\mathbf{0}, \mathbb{\Sigma_{0}})\] \[ \text{Class 1:} \; \mathbf{X} \sim MVN(\mathbb{\mu_{1}, \mathbb{\Sigma_{1}}}) = MVN(\mathbb{\Delta}_{\mu}, \mathbb{\Sigma_{0}} - \mathbb{\Delta}_{\Sigma}) \] For 8 predictors, the mean and covariance structure for class 0 was: \[ \mu_0 = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \Sigma_0 = \begin{bmatrix} 1 & 0.2 & 0.2 & 0.2 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 1 & 0.2 & 0.2 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 1 & 0.2 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 0.2 & 1 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 0.2 & 0.2 & 1 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 0.2 & 0.2 & 0.2 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix}. \]

Data Generating Mechanism

Data Generating Mechanism

The mean and covariance structure for class 1 was: \[ \mu_1 = \begin{bmatrix} \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \end{bmatrix}, \quad \Sigma_1 = \begin{bmatrix} 1-\delta_\Sigma & z & z & z & z & z & 0 & 0 \\ z & 1-\delta_\Sigma & z & z & z & z & 0 & 0 \\ z & z & 1-\delta_\Sigma & z & z & z & 0 & 0 \\ z & z & z & 1-\delta_\Sigma & z & z & 0 & 0 \\ z & z & z & z & 1-\delta_\Sigma & z & 0 & 0 \\ z & z & z & z & z & 1-\delta_\Sigma & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1-\delta_\Sigma & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1-\delta_\Sigma \end{bmatrix}. \]

  • \(z = .2(1 - \delta_{\Sigma})\), to ensure equivalent correlation matrices between the two classes
  • The parameters \(\delta_{\mu}\) and \(\delta_{\Sigma}\) of each scenario were selected to get a C-statistic of 0.85, providing a stable baseline for comparison

Data Generating Mechanism

\[ C = \Phi \left(\sqrt{\Delta'_\mu ( \Sigma_0 + \Sigma_1)^{-1} \Delta_\mu} \right) \]

  • make sure to talk about concordance = ROC(AUC) in dichotomous case

Model Development

  • A two-step procedure was followed for every model:
    • Pre-process the training data with a class imbalance correction method
    • Train a machine learning algorithm on the resulting data.
  • Implemented a 5x6 full-factorial design to compare predictive performance
    • 1 control, 4 corrections and 6 machine learning algorithms
    • 30 unique models were developed and compared in each of the 18 scenarios

Model Development - Imbalance Corrections

  • Five different approaches to handling class imbalance were compared:
    • Control: No correction was applied. The model was trained on the original, imbalanced data.
    • Random Under Sampling (RUS): Randomly removes samples from the majority class to achieve balance.
    • Random Over Sampling (ROS): Randomly duplicates samples from the minority class.
    • SMOTE (Synthetic Minority Over-sampling Technique): Creates new, synthetic samples for the minority class by interpolating between existing ones.
    • SENN (SMOTE + Edited Nearest Neighbors): A hybrid method that first applies SMOTE and then removes observations that are likely noise.
  • Shorten bullet points

Model Development - Machine Learning Algorithms

  • Six machine learning algorithms, frequently used in clinical prediction, were evaluated:
    • Logistic Regression (LR)
    • Support Vector Machine (SVM)
    • Random Forest (RF)
    • XGBoost (XG)
  • Additionally, two ensemble algorithms specifically designed to handle imbalance were included:
    • RUSBoost (RB): A boosting algorithm that incorporates random undersampling in each iteration.
    • EasyEnsemble (EE): A bagging-based algorithm that uses undersampling.
  • Shorten bullet points

Simulation Methods

  • For each of the 18 scenarios, 2000 independent datasets were generated.
  • Each dataset was composed of a training set and a validation set that was 10 times larger to ensure stable performance evaluation.
  • Models were trained on the training data and their performance was assessed on the unseen validation data.

Simulation Methods

  • A logistic re-calibration step was also performed on all model predictions to see if post-hoc adjustments could fix any initial miscalibration.
  • Miscalibration, use equation

Performance Meaures

  • Model performance was evaluated using three key types of metrics:
    • Calibration: Assessed visually with flexible calibration curves and quantitatively with the calibration intercept (ideal=0) and calibration slope (ideal=1).
    • Discrimination: The model’s ability to separate events from non-events, measured by the Concordance Statistic (C-statistic), which is equivalent to the AUC (ideal=1).
    • Overall Performance: A single score reflecting both calibration and discrimination, measured by the Brier score (ideal=0).
  • Shorten bullet points
  • add speaker notes about each metric

Software & Error Handling

  • All simulations were conducted in R, leveraging a high-performance computing (HPC) cluster for efficiency.
  • The caret package was used for systematic hyperparameter tuning.
  • A clear error-handling protocol was established: if an imbalance correction or ML algorithm failed, the process would continue where possible (e.g., by using uncorrected data) and the failure would be logged.
  • Shorten bullet points
  • Results not exactly reproducible
  • Run simulation twice with equivalent results
  • Mention imbalance correction issue
  • Include table S1

Results

Results

  • Primary Finding: Across all scenarios with class imbalance, models developed without an imbalance correction consistently demonstrated equal or superior calibration compared to models where corrections were applied.
  • Calibration:
    • Applying any imbalance correction—whether through pre-processing (RUS, ROS, SMOTE, SENN) or using specialized algorithms (RB, EE)—systematically introduced miscalibration.
    • This miscalibration was consistently characterized by an over-estimation of risk.
  • Discrimination:
    • The impact on discrimination was inconsistent and highly dependent on the algorithm. Any observed benefits were generally small.
  • Overall Performance:
    • The control models (trained on original, imbalanced data) consistently had the best (lowest) Brier scores, indicating better overall performance.
  • Re-calibration:
    • Post-hoc re-calibration was not a silver bullet. While it could adjust the average predicted risk, it could not fully fix the underlying miscalibration (i.e., the calibration slope) introduced by the imbalance corrections.
  • Shorten bullet points

Results

embed shiny app

Results

  • table 4
  • scenario 4-6

MIMIC-III Data Case Study

MIMIC-III Data Case Study

  • Goal: To test if the simulation findings hold true on a real-world, complex dataset.
  • Data: The MIMIC-III database was used to develop models predicting 90-day mortality for ICU patients. The dataset had a moderate event fraction of 0.17.
  • Methods: The exact same 30 model-building pipelines from the simulation were applied to the MIMIC-III data.
  • Findings:
    • The case study results strongly corroborated the simulation findings.
    • Every model that used an imbalance correction exhibited significant miscalibration, systematically overestimating the risk of mortality.
    • These models also had dramatically worse overall performance (Brier score) compared to their uncorrected counterparts.
  • two slides
  • Case study overview
  • Case study results

Discussion

Discussion

  • This study provides strong evidence that for developing calibrated clinical prediction models, applying common imbalance corrections is often harmful.
  • The primary harm is a systematic overestimation of risk, which can lead to poor clinical decisions. This miscalibration is not easily fixed by post-hoc methods.
  • The potential small gains in discrimination from some corrections rarely outweigh the significant cost to calibration.
  • Standard ML algorithms (LR, SVM, RF, XG) are often surprisingly robust and can produce well-calibrated models when trained directly on imbalanced data.
  • Limitations: The study was confined to low-dimensional settings (8-16 predictors). Further research could explore higher dimensions.
  • Shorten bullet points

Conclusion

Conclusion

  • Correcting for class imbalance is a widely used technique, but its negative impact on model calibration has been under appreciated.
  • When the goal is to produce reliable and accurate risk estimates for individual patients, applying imbalance corrections may do more harm than good.
  • Researchers and practitioners should be cautious and prioritize model calibration, questioning whether imbalance correction is truly necessary for their specific application.
  • Shorten bullet points

References

References

Carriero, Alex, Kim Luijken, Anne de Hond, Karel GM Moons, Ben van Calster, and Maarten van Smeden. 2024. “The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study.” https://arxiv.org/abs/2404.19494.
Goorbergh, Ruben van den, Maarten van Smeden, Dirk Timmerman, and Ben Van Calster. 2022. “The Harm of Class Imbalance Corrections for Risk Prediction Models: Illustration and Simulation Using Logistic Regression.” https://arxiv.org/abs/2202.09101.